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Abstract 

We introduce a new criterion to select in a consistent way the probabilistic 
context tree generating a sample. The basic idea is to construct a totally or- 
dered set of candidate trees. This set is composed by the "champion trees", the 
ones that maximize the likelihood of the sample for each number of degrees of 
freedom. The smallest maximizer criterion selects the infimum of the subset of 
champion trees whose gain in likelihood is negligible. In addition, we propose 
a new algorithm based on resampling to implement this criterion. This study 
was motivated by the linguistic challenge of retrieving rhythmic features from 
written texts. Applied to a data set consisting of texts extracted from daily 
newspapers, our algorithm identifies different context trees for European Por- 
tuguese and Brazilian Portuguese. This is compatible with the long standing 
conjecture that European Portuguese and Brazilian Portuguese belong to dif- 
ferent rhythmic classes. Moreover, these context trees have several interesting 
properties which are linguistically meaningful. 



1 Introduction 

This paper has three main contributions. First of all, we introduce the 
smallest maximizer criterion which selects in a consistent way the probabilistic 
context tree generating a sample. This is a constant free approach to the 
problem of probabilistic context tree selection. We also propose an algorithm 
to implement this criterion and effectively identify a probabilistic context tree 
from a finite sample. Finally, we apply this procedure to address the challenging 
linguistic question of how to retrieve rhythmic features from written texts, 
which was at the origin of this paper. 
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The search for rhythmic signatures in written texts is important from dif- 
ferent scientific point of views. For example, it is an important ingredient for 
developing realistic text to speech synthesizers. Also, it is a helpful tool to 
describe the historical evolution of the rhythm of a natural language, as the 
only available evidence is that which can be retrieved from written texts. 

Stochastic chains with memory of variable length appear as good candidates 
to model the symbolic chains obtained by encoding written texts in natural 
languages. In effect, it can be argued on linguistic grounds that in a rhythmic 
chain each new symbol is a probabilistic function of a suffix (ending string) 
of the string of past symbols. Moreover, the length of the relevant portion of 
the past depends on the past itself. This corresponds precisely to the class of 
probabilistic context tree models introduced by Rissanen in his seminal 1983 
paper A universal data compression system in which the relevant part of the 
past is called a context. 

Given a finite realization of a stochastic chain with memory of variable 
length, the basic statistical question is how to identify the smallest probabilis- 
tic context tree fitting the data. This issue has been addressed by an increasing 
number of papers, starting with Rissanen (1983) who introduced the so-called 
algorithm Context to perform this task. Several variants of the algorithm Con- 
text have been presented in the literature. An incomplete list includes Ron 
et al. (1996), Biihlmann and Wyner (1999) and Galves et al. (2008). For a 
survey of the results on the algorithm Context we refer the reader to Galves 
and Locherbach (2008). 

A different approach was proposed by Csiszar and Talata (2006) who showed 
that context trees can be consistently estimated in linear time using the Baycsian 
Information Criteria (BIG). We refer the reader to this paper for a nice descrip- 
tion of other approaches and results in this field, including the Context Tree 
Weighting Method (CTW) introduced by Willems et al. (1995). We also refer 
the reader to Garivier (2006a, b) for recent and elegant results on the BIG and 
the Context Tree Weighting Method. 

Both the algorithm Context and the BIG procedure requires the specifica- 
tion of some constants. For the algorithm Context, the constant appears in 
the threshold used in the pruning decision. For the BIG, the constant appears 
in the penalization term. In both cases, the consistency of the algorithm does 
not depend on the specific choice of the constant. However, for finite samples 
- even with very large size - the choice of the constant does matter. Different 
constants will give different answers ranging from the maximum tree (constant 
close to zero) to the root tree (constant very large). 

An adaptive procedure to choose the asymptotic context tree from a finite 
sample is a most important question from the point of view of applied statistics. 
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This is achieved by the smallest maximizer criterion introduced in the present 
paper. In informal terms, the criterion selects the tree which is the infimum of 
a subset of the set of "champion trees" . 

We rigorously prove that the smallest maximizer criterion selects in a con- 
sistent way the finite context tree generating the infinite sample. 

Now the question is how to apply this criterion to identify the tree from a 
finite sample. We propose a new algorithm based on resampling to implement 
the criterion. We make a simulation study which indicates the suitability of 
the procedure. 

We apply the smallest maximizer criterion and its implementation to solve 
a long standing linguistic problem. Can we retrieve rhythmic features from 
written texts? 

Modern Portuguese provides an interesting case to be analyzed from the 
point of view of rhythm. European Portuguese and Brazilian Portuguese (hence- 
forth EP and BP respectively) share the same lexicon. From the point of view 
of external language, they also produce a great number of superficially iden- 
tical sentences (for the dichotomy internal and external language we refer the 
interested reader to Chomsky 1985). However EP and BP have been argued 
to implement different rhythms (cf. for instance Revah 1958 and Sandalo et al. 
2006). 

To verify this conjecture, the smallest maximizer criterion was applied to a 
real linguistic data set, constituted for the needs of the present study, consisting 
of randomly chosen written texts extracted from a corpus of Brazilian and 
European Portuguese daily newspapers. These texts were encoded using a finite 
set of labels, expressing a few basic rhythmic features which can be retrieved 
automatically from written texts. 

The smallest maximizer criterion selects different context trees for BP and 
EP. The difference between the context trees can be linguistically interpreted 
in a way which is compatible with current hypotheses on the characteristic 
features of the different rhythmic classes. 

This article is organized as follows. Section [2] presents the class of proba- 
bilistic context tree models and states the main theoretical results supporting 
the proposed algorithm. Section [3] presents the smallest maximizer criterion 
(SMC) and its implementation is given in Section [H Section [S] is dedicated 
to the linguistic case study which is the original motivation for this article. In 
Section [6] a simulation study illustrates in a concrete way the good performance 
of the algorithm implementing the smallest maximizer criteria. A final discus- 
sion is presented in Section [71 The mathematical proofs of the theorems are 
given in Appendix 1. Appendix 2 presents, in a more detailed way, the sample 
of symbolic chains, including a discussion of the encoding procedure and the 
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preprocessing of the linguistic data. 



2 Stochastic chains with memory of vari- 
able length 

Stochastic chains with memory of variable length, also called probabilistic con- 
text tree models, have the property that, for each string of past symbols, only 
a finite suffix (ending string) of the past is enough to predict the next symbol. 
Following Rissanen (1983) in which these models were introduced, let us call 
context this relevant part of the past. 

These models are characterized by the set of all contexts and an associated 
family of transition probabilities. Given a context, its associated transition 
probability gives the distribution of occurrence of the next symbol immediately 
after the context. 

The length of a context is a stopping time of the reversed chain. This 
means that to know if a context has length k, we only need to inspect the last 
k symbols of the string. In other terms, if we travel back in the string of past 
symbols, we can determine the border of the context without any knowledge of 
the symbols which are behind the border. 

Let us translate this in more formal terms. Let A he a finite alphabet. 
We will use the shorthand notation to denote the string {wm, . . . ,Wn) of 
symbols in the alphabet A. The length of this string will be denoted by ^(tfj^) — 
n — m + 1. We say that a sequence is a suffix of a sequence wZl. if J < ^ 
and s_j = W-i for alH = 1, . . . ,j. This will be denoted as sZ^ ^ ^^k- If i < ^ 
then we say that s is a proper suffix of w and denote this relation by s -< tf. 
The same definition applies when wZlo is a semi-infinite sequence. 

Definition 2.1. A finite subset r o/ U^^^^"*^'"''"^^ is an irreducible tree if 
it satisfies the following conditions. 

1. Suffix property. For no wZ^. E t we have wZl.^j € t for j = 1, . . . ,k — 1. 

2. Irreducibility. No string belonging to t can be replaced by a proper suffix 
without violating the suffix property. 

It is easy to see that the set r can be identified with the set of leaves of a 
rooted tree with a finite set of labeled branches. Elements of r will be denoted 
either as w or as wZl if we want to stress the number of elements of the string. 

Let p = {p{-\w) : w E t} he a family of probability measures on A indexed 
by the elements of r. The elements of r will be called contexts and the pair 
(r,p) will be called probabilistic context tree. The number of contexts in r will 
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be denoted by |t|. The height £(t) of the tree r is the maximal length of a 
context in r, that is 

1{t) = max{i(w) : w € r}. 

We recall that we are assuming that r is a finite set and therefore £ is finite. 

Definition 2.2. The stationary ergodic stochastic process {Xf) on A has mem- 
ory of variable length compatible with the probabilistic context tree (r, p) if 

1. For any n > 1{t) and any sequence xZ\ 

P(Xo = a I Xzl = xZl) = p{a\xZ]). for all a A, (2.3) 

where xlj is the only suffix of xZ\ belonging to r. 

2. No proper suffix of xZ^ satisfies Ii2. 3\) . 

Definition 2.4. Define the following partial ordering on the set of all context 
trees. We will say that r ^ r', if for every v € t' , there exists w £ t such that 
w ^ V. As usual, whenever t <t' with t ^ t' we will write t ~< t' . 



3 Smallest maximizer criterion 

Given a finite sample Xi, . . . ,X„ of elements in A generated by (r*,p*), the 
model selection problem is to find a procedure based on X^ to estimate the 
tree r*. 

For any finite string w^j with j < d{n), we denote by Nn{w^j) the number 
of occurrences of the string w^j in the sample 

n 

iVnK_,)= l{^*-,=^-,}> (3-1) 

t=d{n)+l 

where d{n) is a suitable function of n such that d{n) ^ oo as n — > oo. 

Assuming the sample was generated by a stationary chain, for any finite 
string wZl such that X^^gyi -^n(u'~],6) > 0, the maximum likelihood estimator 
of the transition probability ¥{Xq = a|X~^ = wZl.) is given by 

_i NniwZla) 
Pn{a\w_k) = — _i , (3.2) 

where wZla denotes the string {w-k, ■ ■ ■ , ,W-i,a), obtained by concatenating 
wZl- and the symbol a. 

The likelihood function for a tree r is given by 

Let Tn = T(Xi, . . . , Xn) be the set of all irreducible trees r such that 
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• £{t) < d{n) ; 

• for all w £t, ^{,^^1 Nn{wb) > ; 

• any sequence u with YlbeA ^n{ub) > has a suffix that belongs to r or is 
a suffix (proper or not) of an element in r. 

Let df : 7^ ^ N be a function that assigns to each tree r G 7^ the number 
of degrees of freedom of the model corresponding to the context tree r. The 
definition of df(T) depends on the class of models considered. Without any 
restriction df(T) = (|^| — l)jT|. However, in many scientific data sets we know 
beforehand that some transitions are not allowed by the nature of the problem. 
That is the case of the linguistic data set we are considering in our case study 
presented in Section [5l In general, we can define an incidence function x '■ 
U'jLiA^~^'''''~^'^^ — {0, 1} which indicates in a consistent way which are the 
possible transitions. By consistent we mean that if x(^Zjo) = for some wZ^ 
and a € A then x(^Ifcfl) = for all k > j. In this case, 

df(T;x) = ^^Xiwa). 

Obviously, we are using the convention that x{wo) = means that the transi- 
tion from w to a is not allowed. 
Then 

T„ = u ^i^^ 

g<^Qn 

where Gn = df(T„) and T^^^ = {r ^Tr,: df(r) = g\. 

For each g G Gn let be the tree belonging to the class Tn^^ which 
maximizes the likelihood of the sample, that is 

T^^^ = arg max log Lr{Xi). 

Denote by C„ the class of champion trees belonging to 7^, that is 

Cn = {t^^^ ■ g & Gn such that L (a'){X^) < L (g){X^) whenever g' < g}. 

Observe that it is possible to have g' < g with L (g/)(X") > L {g){X^). In 
the definition of Cn we discard the bigger tree since the tree with less parameters 
provides larger likelihood. 

Define also the class C of all champion trees for the infinite sample, that is 

C=\JCn. 

n>l 

Observe that the set of all context trees is not totally ordered with respect 
to the ordering introduced in Definition 12. 4[ It turns out that, for any n, the 
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set of champion trees C„ is totally ordered and contains the tree generating the 
sample for sufficiently large sample sizes. This is the basis for the selection 
principle and is the content of the next theorem. 

Theorem 3.4. Assume Xi, . . . , Xn is a sample of an ergodic stochastic process 
compatible with {t*,p*), with r* finite. Then, Cn is totally ordered with respect 
to the order ~< and eventually almost surely t* ^Cn as n ^ oo. 

The next theorem is the basis for the smallest maximizer criterion. It shows 
that there is a change of regime in the gain of likelihood at r* . 

Theorem 3.5. Assume Xi, . . . ,Xn is a sample of an ergodic stochastic pro- 
cess compatible with {t*,p*) with t* finite. Then, the following results hold 
eventually almost surely as n ^ oo. 

(1) For any t € Cn, with t ~< t* , there exists a constant c{t* ,t) > such that 

l0gLr^{X^)-l0gLr{X^) > ciT*,T)n. 

(2) For any t -< t' ^ Cn, with r* < t, there exists a constant c{t,t') > such 
that 

logL,,(Xn - logL^iX^) < c(r,T') log n. 

Theorems 13.41 and 13.51 lead to the following Smallest Maximizer Criterion. 
Smallest Maximizer Criterion. Select the smallest tree f in the set of 
champion trees C such that 

^.^ logL^(Xf)-logL^(Xf) ^ ^ 

for any t ^ t. 

The next theorem states the consistency of this criterion. 

Theorem 3.6. Let Xi,X2, ... be an ergodic chain compatible with the proba- 
bilistic context tree {t*,p*) with t* finite. Then, 

P(f / T*) = 0. 

To avoid technical details and facilitate the reading, we delay the proofs of 
Theorems 13.41 13.51 and 13.61 to Appendix 1. 

The problem now is how to identify this smallest tree. A procedure doing 
this is presented in the next section. 
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4 Implementing the smallest maximizer cri- 
terion 



In order to select the model first we need an algorithm to compute the set of 
champion trees Cn C Tn. To do this we explore the relationship between our 
criteria and the BIC context tree selection. 

Definition 4.1. The BIC context tree estimator with penalizing constant c > 
is defined as 

fB:c(Xi"; c) = arg max{log L,(Xr) - c • d/(r) • log n} (4.2) 

where Lt-(X") is the likelihood of the tree r given the sample and df{T) 
denotes the number of degrees of freedom of the model corresponding to the 
context tree r. 

Proposition 4.3. The set of champion trees Cn is the image of the map 

c G [0, +oo) ^ fsic{Xi, c) e Tn. 

Remark: Csiszar and Talata (2006) prove the consistency of the BIC selection 
procedure in the case of unbounded trees when d{n) = o(logn). Besides the 
consistency of the procedure, this condition also implies that the estimation 
can be done in linear time using the context tree maximizing (CTM) algo- 
rithm introduced by Willems et al. (1995). Assuming that the tree is bounded, 
Garivier (2006a) proves consistency of the BIC selection procedure for any di- 
verging function d{n). This is the case we consider here. Therefore, the above 
proposition implies that all champion trees C„ can be obtained using the CTM 
algorithm by changing the penalizing constant in the BIC. 

The next step is to identify a tree f belonging to C„ for n sufficiently large 
but finite. Theorem l3. 41 guarantees that r* € C„. In this case we have to choose, 
among the champion trees belonging to C„, the smallest one for which the gain 
in likelihood is negligible when compared to bigger ones. 

To do this, we propose a bootstrap procedure. To determine the change of 
regime we compare the bootstrap confidence intervals. In practice, we compare 
the ratio between the gain in log-likelihood and the size of the sample with the 
boxplots of the resamples. We expect that for Tn^ >- t* the confidence intervals 
constructed with the increasing sample sizes will decrease, whereas for Tn'^ -< r 
the confidence intervals will either converge to a point or increase. 
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Bootstrap Procedure: 

1. For different sample sizes ni < n2 < ■ ■ ■ < ur < n suitably chosen 
obtain B independent bootstrap resamples of Xi,... ,Xn- Denote these 
resamples by X*'*^**'^) = {X*'^'^'^\i = 1,. . . ,nj} for b = 1,. . . ,B and j = 
1,...,R. 

2. For j = 1,2, . . . ,R and for all r,!^^ G C„ and its successor Tn ^ G C„ in the 
-< order, compute the 1st and 3rd quartile for the ratio 

logL (,)(X*'(''.^)) - logL (,o(X*.(^'^)) 

'jn Jn 

Uj 

Denote them by Q^fj and Q^j respectively. 

3. Select the tree f as the first champion tree Tn^ such that the resampled 
confidence interval [Q^ij,Q^^j] shrinks to zero as j increases. 

In Step 1 above, any bootstrap resampling method for stochastic chains 
with memory of variable length can be used. In our specific case, we use a 
remarkable feature for our data set, that is, the fact that one of the symbols is 
a renewal point. This makes it possible to sample randomly with replacement 
independent strings between two successive renewal points. 

5 A linguistic case study 

It has been conjectured in the linguistic literature that languages are divided 
into different rhythmic classes (Lloyd 1940, Pike 1945, Abercrombie 1967, 
among others). In particular it has been argued that European Portuguese 
and Brazilian Portuguese belong to different rhythmic classes (cf. for instance 
Sandalo et al. 2006, and Frota and Vigario 2001 for a critical discussion of the 
rhythmic features of BP and EP). We refer the reader to Ramus (2002) for an 
illuminating discussion of the rhythmic class conjecture. 

During half a century, neither a precise definition of each class, nor any 
reliable phonetic evidence of the existence of the classes was presented in the 
linguistic literature. The situation started changing at the end of the century. 
First of all, Mehler et al. (1996) gave empirical evidence that newborn babies 
are able to discriminate rhythmic classes. Then Ramus, Nespor and Mehler 
(1999), gave for the first time evidence that simple statistics of the speech 
signal could discriminate between different rhythmic classes. A sound statistical 
basis to this descriptive analysis was given in Cuesta et al. (2007) who used 
the projected Kolmogorov-Smirnov test to classify the sonority paths of the 
sentences in the sample analyzed in Ramus, Nespor and Mehler (1999). 
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The above mentioned papers were all based on the study of speech data. 
However for the purposes of historical linguistics the only available data are 
written texts. This is precisely the challenge faced by the Tycho Brahe project 
(Iwww . tycho . iel . unicamp . br) which aims to study how the rhythmic features 
of Portuguese changed between the 16*'* and the 19*^ century. 

It has been claimed by many classical phonologists that the rhythmic fea- 
tures of Portuguese suffered a major modification in Portugal somewhere be- 
tween the 17^^ and the 18*^ centuries. As a consequence, from the point of 
view of rhythm the 16*^^ century Portuguese would be closer to modern Brazil- 
ian Portuguese than to modern European Portuguese (cf. Revah 1954 and 
Teyssier 1980 among others). Therefore it was natural to start with a study of 
the rhythmic properties of written texts of BP and EP in an attempt to find a 
point of comparison with the results of a forthcoming study of the texts of the 
Tycho Brahe Corpus of Historical Portuguese. 

The modern data we analyze is an encoded corpus of newspaper articles. 
The electronic files with these articles are available through the project AC/DC 



(Acesso a Corpora/Disponibilizagao de Corpora) at the URL www.linguateca.pt/acesso 



corpus CHAVE (see Santos and Rocha 2005 for a presentation of the corpus). 
This corpus contains all the 365 editions of the years 1994 and 1995 from the 
daily newspapers Folha de Sao Paulo (Brazil) and O Publico (Portugal). Our 
sample consists of 80 articles randomly selected from the 1994 and 1995 edi- 
tions. We chose 20 articles from each year for each newspaper. We ended up 
with a sample of 97,750 symbols for Brazilian Portuguese (BP) and 105,326 
symbols for European Portuguese (EP). 

Encoding was made by assigning one of four symbols to each syllable of the 
text according to whether: (i) it is stressed or not; (ii) it is the beginning of a 
prosodic word or not. By prosodic word we mean a lexical word together with 
the functional non stressed words which precede it (cf. for instance Vigario 
2003). Using the base 2 representation of the integers, this double Boolean 
classification can represented by the four symbols alphabet {0, 1, 2, 3, } where 

• = non-stressed, non prosodic word initial syllable; 

• 1 = stressed, non prosodic word initial syllable; 

• 2 = non-stressed, prosodic word initial syllable; 

• 3 = stressed, prosodic word initial syllable. 

Additionally we assigned an extra symbol (4) to encode the end of each 
sentence. Let us call A = {0, 1, 2, 3, 4} the alphabet obtained in this way. 

An example will help understanding the encoding. The sentence O menino 
ja comeu o doce (The boy already ate the candy) starts with the prosodic word 
O menino and the sentence is encoded as 
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Sentence O me ni no ja co meu o do ce 
Code 201032 1 2104 

This encoding can be performed automatically after a preprocessing of 
the texts (see Appendix 2). A software written in Perl was developed for 
this purpose and it is available upon request. The corpora with the encoded 
newspapers texts can be freely downloaded for academic purposes at URL 
www . ime . usp . br/~tycho/ smc/ data, 

This way to encode written texts according to its rhythmic properties is 
new, and the data set we are considering has never been analysed from this 
point of view before. 

It is worth observing that the symbolic chain obtained this way is con- 
strained both by general restrictions on possible sentences and by the morphol- 
ogy of Portuguese. Therefore several transitions are impossible. For instance, 
the symbol 4 which encodes the separation between sentences can only be fol- 
lowed by the symbols 2 or 3 which encode the beginning of prosodic words. 
These restrictions are the key to computing the number of degrees of freedom 
of the proposed model. The full set of restrictions is described in Appendix 2. 

To implement the smallest maximizer criterion as described in Section H] we 
first need to identify the set of champion trees. By Proposition 14.31 this can be 
done by changing the penalization constant in the BIG for context trees. As 
proved in Csiszar and Talata 2006, this can be done in linear time. This way 
we obtain the set of champion trees for both languages, Cbp and Cep, which 
ranges from the root tree (independent case), with 1^41 — 1 degrees of freedom, to 
trees with several thousand degrees of freedom. The function df, which assigns 
to each model its number of degrees of freedom, was computed taking into 
account the constraints of the symbolic chain mentioned in the last paragraph. 
Figure [1] presents the log-likelihood corresponding to each champion tree for 
BP and EP according to the number of leaves. The figure clearly suggests that 
there is a change of regime in a certain region. However, a visual inspection is 
not enough to detect precisely in which tree it takes place. 

To implement the bootstrap procedure we choose R=3, ni = 10, 000, n2 = 
40, 000 and = 70, 000 and B = 250. To resample, we take advantage of 
a striking feature which is present in all the champion trees. The symbol 4 
appears as a renewal point, that is, p*{-\w4:u) = p*{-\w4:) for any finite sequence 
w. Therefore, we use the independent blocks between two consecutive symbols 
4's to perform the usual Efron's independent with replacement bootstrap pro- 
cedure. The final resample of size nj is obtained by the concatenation of the 
successively chosen independent blocks truncated at size rij. 

A software written in C was developed to implement both the identification 
of the champion trees and the bootstrap procedure. This software is available 
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Figure 1: Log- likelihood of the sample as a function of the number of leaves 
for Brazilian Portuguese (a) and European Portuguese (c). Figures (b) and 
(d) are zoomed pictures from (a) and (c) for number of leaves bigger than 10. 

upon request. 

According to the smallest maximizer criterion we compared the ratio of the 
log-likelihood and the sample size for the resamples. Figures [2] and [3] show 
some of the corresponding boxplots as well as the observed value for the whole 
sample (solid line). We can see clearly a change of regime in the tree with 15 
leaves for BP and 18 leaves for EP corresponding to trees shown in Figures H] 
andO 

Besides discriminating EP and BP, the selected trees have properties which 
are linguistically interpretable. First, in both trees, 4 is a context. This is ex- 
pected on linguistic grounds since both in syntax and in phonology, the sentence 
is the higher domain. 

Second, in both trees, non stressed internal syllables provide poor informa- 
tion about the future. Three successive symbols zero are needed to constitute 
a context. This is also a welcome result from a linguistic point of view since 
non stressed non initial syllables do not play a salient role in rhythm by their 
own, but only by constituting prosodic domains (feet) with stressed syllables. 



12 
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Figure 2: Resampling boxplots of the ratio of the log-hkehhood and the sample 
size for Brazilian Portuguese and observed sample value (solid line). 

or by being aligned with higher prosodic domains (words or phrases). 

Note that stressed syllables are also not sufficient to predict the future. The 
tables of transition probabilities (Figures H] and [5|) show that in both languages 
the distribution of what follows a stressed syllable is dependent on the presence 
or absence of a preceding prosodic word boundary in the two preceding steps. 
This fact, arguably derivable from the morphological patterns of words in Por- 
tuguese and their frequency in use, does not discriminate EP and BP, which is 
expected, since to a great extent they share the same lexicon. 

Finally, according to the selected trees, the main difference between the two 
languages is that whereas in BP, both 2 (unstressed boundary of a prosodic 
word) and 3 (stressed boundary of a prosodic word) are contexts, in EP only 
3 is. This means that in EP, but not in BP, the words which begin with a 
stressed syllable behave differently from the words which don't. This is again a 
welcome result since it is compatible with already observed differences involving 
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Figure 3: Resampling boxplots of the ratio of the log-hkehhood and the sample 
size for European Portuguese. 

the prosodic properties of words in the two languages (cf. Vigario (2003) and 
Sandalo et al. (2006), among others). 

6 Simulation results 

We perform a simulation study using the context tree and the transition prob- 
abilities presented in Figure [H We simulate a sample with 100,000 symbols, 
obtaining 1,882 phrases (sequences delimited by the symbol 4). Using this sam- 
ple, we estimate the sequence of champion trees by increasing the value of the 
penalizing constant and considering only trees with height smaller or equal 7. 
This procedure gives a sequence of trees containing the true tree of 13 leaves. 
As an illustration of Theorem 13.51 we plot the log-likelihood corresponding to 
each tree as a function of the number of leaves. We can see a change of regime 
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Figure 4: Probabilistic context tree for Brazilian Portuguese. 

at the tree with 13 leaves, but its identification using this graphical represen- 
tation is difficult because the tree with 11 leaves has a log-likelihood very close 
to the real one. 

In order to identify the true tree we asses the convergence of the difference 
in log-likelihood of two adjacent trees when divided by the sample size. For 
each size Uj = j ■ 10, 000, with j = 1, 2, . . . , 8 we obtain 250 resamples, by sam- 
pling with replacement between the 1,882 phrases. For each resample and for 
each pair of consecutive trees (rj, Tj+i), we compute the difference between the 
logarithm of the likelihoods and divide this quantity by nj. The corresponding 
boxplots for each value of nj and for the trees with 8, 11, 13, 16 and 17 leaves 
is presented in Figure [8l Note that we can clearly see a change of behavior in 
the boxplots when considering trees bigger or equal the real tree. 

7 Final discussion 

In this paper we introduce the smallest maximizer criterion to estimate the 
context tree of a chain with memory of variable length from a finite sample. 
The criterion selects a tree in the class of champion trees. This class coincides 
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Figure 5: Probabilistic context trees for European Portuguese. 

with the subset of trees obtained by varying the penahzing constant in the BIC 
criterion. For this reason, the smallest maximizer criterion actually suggests 
a tuning procedure for the BIC context tree selection. Therefore, the present 
paper can be interpreted as an effort to solve the most important problem of 
constant-free model selection in the case of probabilistic context tree models. 

To our knowledge Biihlmann (2000) was the first to address the problem 
of how to tune a context tree estimator, in the case of the algorithm Context. 
This paper proposes the following tuning procedure. First use the algorithm 
Context with different values of the threshold to obtain a sequence of candidate 
trees. For each one of these candidate trees estimate a global risk function, as 
for example the Final Prediction Error (FPE) or the Kullback-Leibler Informa- 
tion (KLI), by using a parametric bootstrap approach. Then choose as cut-off 
parameter the one providing the tree with smallest estimated risk. 

In the above mentioned paper there is no proof that the sequence of nested 
trees obtained by the pruning procedure using the algorithm Context will con- 
tain eventually almost surely the tree generating the sample, which in our case 
is given in Theorem [331 It also misses the crucial point of the change of regime 
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Figure 6: Context tree and transition probabilities over the alphabet A = 
{0,1,2,3,4}. 

in the set of champion trees, which is given in our Theorem 13.51 

The change of regime was not missed by the more recent paper of Dalevi and 
Dubhashi (2005). They extend to chains with memory of variable length the 
order estimator introduced in Peres and Shields (2005). They suggest without 
any rigorous proof that at the correct order there exists a sharp transition 
that can be identified from a finite sample. Then they applied the criterion to 
the identification of sequence similarity in DNA. Our main contribution with 
respect to this paper is the rigorous proof of Theorem 13. 6[ 
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Appendix 1 - Proof of the theorems 

Proof of Proposition [33] Let us first show that for any c > 0, f^ic{X^,c) 
belongs to C„,. Recall that Tn = \Jg<^g^Tn^\ Therefore, for any c > 

arg maxjlog Lt-(X") — c • df(r) • log n} = arg max max {log L^(X") — c • df(r) • log n} 

reTn geGn reT^c'^ 

= arg max{log L (X") - c • 5 • log n}. 

Since Qn is finite, for each c > the maximum in the above equation is reached. 
Since different champion trees have different likelihood there exists only one 
champion tree corresponding to each constant c > 0. 

Now we have to prove that for any Tn"^ € C„, there exists a positive constant 
c = c{g) such that Tn"* = 'rBic(-^f ; c). By definition, for any two champion trees 
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Figure 8: Simulation results. 

Tn ^ and Tn ^ belonging to Cn, with g' < g" , we have 

logL „o(xr)<iogL^(,")(^r)- 

Therefore, the rate 

logL (,,,(Xp)-logW,(Xp) 
9' - 9" 

is always positive. The result follows by choosing c as 

riogL (,o(Xr)-logL ,,)(Xf) ] 
c = min j J— ,g' eGn\{9]\- 

This concludes the proof of the proposition. 

The tools to prove Theorems 13.41 and 13.51 are borrowed from Csiszar and 
Talata (2006). 
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Proof of Theorem 13.41 First recall that the BIG context tree estimator 
is strongly consistent for any constant c > 0. Therefore, since the set C is 
countable, it follows that eventually almost surely r* € C„ as n ^ oo. 

The fact that the champion trees are ordered by -< follows immediately from 
the following lemma and Proposition I4.3i 



Lemma 8.1. Let < ci < C2 he arbitrary positive constants. Then 

TBic(Xr;ci) ^ fBic(Xr;c2). 
Proof. For a string w with i[w) < d{n) define 

and df(?i;) = ^a&AX{wa)- Then, for any constant c > define recursively the 
value 



max{n--<ifHL^(Xr), H^eA ^a«,(^r)}, if < l{w) < d{n), 
n~^-<^^ML^{Xf), if £{w) = d{n) 



and the indicator 



1, if < l[w) < din) and UaeA ^a«,(^f ) > n-df{-)L^(Xf ), 
0, if < £{w) < din) and UaeA V^iXf) < n-df{-)L^(x«), 
0, if£iw) = din). 



Now, for any finite string w, with iiw) < din) and for any tree r G 7^, 
we define the irreducible tree as the set of branches in r which have w as a 
suffix, that is 

Tw = {u £ t: w ^ u}. 
Let 7^(X") be the set of all trees defined in this way, that is 

T^(Xi") = K:tgT4. 

If is a sequence such that (5^(X") = 1 we define the maximizing tree assigned 
to the sequence w as the tree r^(X") G 7^(X™) given by 

T^\X^) = {u: 6liX^) = 0, 6^^iX^) = 1 for all ^ ^ u}. 

liw is a sequence such that = 0, we define the maximizing tree assigned 

to the sequence w as the tree r^(Xp) G 7^(X™) given by 
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Csiszar and Talata (2006) proved that 

V^{X^)= max n^""''^"^^-(^r)= n (8.2) 

Denote by = fBic(^i'; ci) and = 'rBic(-'^f ; C2). Suppose that it is not 
true that t"^ ^ t^. Then there exists a sequence w (z t"^ and w' € such that 
If is a proper suffix of w' . This imphes that 7^ 0. Since is irreducible we 
have that \tI\ > 2. Then, using the definition of maximizinff tree we obtain 

\ogL^{X^) > logL^'(Xr) + ci(df(u;)- ^ df{w'))\ogn 

> \ogL.^.{XD + C2mw)- Y dfiw'))logn 

> logL^(Xn, 

which is a contradiction. The first inequahty fohows from the assumption 
that = Tbic (-'^i' ; ci ) and the second equahty in (j8.2p . To derive the second 
inequahty we use the fact that < ci < C2 and df(ii;) — Ylw'er'^ df{w') < 
0. Finahy, the last inequality leading to the contradiction follows from = 
'rBic(^"; C2) and again the second equality in (j8.2p . We conclude that ^ r^. 

Proof of Theorem 13.51 To prove (1) let t eCnhe such that r -< r*. Then 

iogL,(xr)-iogL,.(xn 

= Y Nn{w' a) log pn{a\w') - Y Nn{wa) log pn{a\w) 

EST AT f M PrMw') 

w'Gt w£t* ,w>-w' agA 

Dividing by n and using Jensen's inequality in the right hand side we have that 

E E E^-I^^E E E.-(..«).o.^ 

as n goes to +00 (by the minimality of r*). Then, for a sufficiently large n 
there exists a constant c(r*, r) > such that 

logL,.(Xr)-logL,(Xi") > c{T*,r)n. 
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To prove (2) we have that 



\ogLr>{X^)-logL^{X^) 

= ^ Nn{w' a) log pn{a\w') - ^ Nn{wa)\ogpn{a\w) 

- Nn{w' a) log pnia\w') - ^ Nn{wa) log p*{a\w) 

w'£T',a£A w£T,aGA 

= Z^^riiu! a) log— —— 

A-/ p*(a\w) 

wer w'er' ,w'>-w aeA i \ i / 

= E E A^nK-)^(Pn(-k')l|P*(-k))- 

By Lemmas 6.2 and 6.3 in Csiszar and Talata (2006) we have that, if n is 
sufficiently large, we can bound above the last term by 

[pn{a\w') -p*{a\w)f 



^ ^Z^ p*(a\w) 



Jlogn 



where pj^j^ = mmw^T-,aeA{p* {o-lw) : p*{a\w) > 0}. This concludes the proof of 
Theorem 13. 5i 

Proof of Theorem 13.61 It follows directly from Theorems 13.41 and 13. 5[ 



Appendix 2 - Description of the encoded sam- 
ples 

In this section we present more details concerning the sample of encoded texts. 
The sample consists of 40 articles from the 1994 and 1995 editions of the Brazil- 
ian newspaper Folha de Sao Paulo and 40 articles from the 1994 and 1995 
editions of the Portuguese newspaper O Publico. 

The articles were randomly selected in the following way. We first randomly 
selected 20 editions for each newspaper for each year. Inside each edition we 
discarded all the texts with less than 1000 words as well as some type of articles 
(interviews, synopsis, transcriptions of laws and collected works) which are 
unsuitable for our purposes. From the remaining articles we randomly selected 
one article for each previously selected edition. 

Before encoding each one of the selected texts, they were submitted to a 
linguistically oriented cleaning procedure. Hyphenated compound words were 
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rewritten as two separate words, except when one of the components is un- 
stressed. Suspension points, question marks and exclamation points were re- 
placed by periods. Dates and special symbols like "%" were spelled out as 
words. All parentheses were removed. 

To use the smallest maximizer criterion we need to compute the number of 
degrees of freedom of each candidate context tree. To do this we must take 
into account the linguistic restrictions on the symbolic chain obtained after 
encoding. The restrictions are the following. 

1. Due to Portuguese morphological constraints, a stressed syllable (encoded 
by 1 or 3) can be immediately followed by at most three unstressed sylla- 
bles (encoded by 0) . 

2. Since by definition any prosodic word must contain one and only one 
stressed syllable (encoded by 1 or 3), after a symbol 3 no symbol 1 is 
allowed, before a symbol 2 (non stressed syllable starting a prosodic word) 
appears. 

3. By the same reason, after a symbol 2 no symbols 2 or 3 are allowed before 
a symbol 1 appears. 

4. As sentences are formed by the concatenation of prosodic words, the only 
symbols allowed after 4 (end of sentence) are the symbols 2 or 3 (beginning 
of prosodic word). 
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